Seth Metcalfe PSY6422 Project

Before we begin, please make sure you have all the necessary packages installed and loaded according to the list in the README and as outlined in the code below.

#load these libraries
library(here)
library(tidyverse)
library(ggplot2)
library(naniar)
library(sf)
library(ggiraph)
library(svglite)
library(base64enc)

The Data Origins

The Alcohol Profile Data

For this project, I used a data set from the UK Department of Health and Social Care called “Alcohol Profile 2025.” You can access this data set here. To download the correct data set, scroll down on the linked webpage to the “Get the data” section. There are three subheadings, the first of which is “Profile: Alcohol Profile.” Under that are two hyper-linked data sets. The correct data set to download is the first one, which is called “Data for Government Office Region (E12).” Because I am using the “here” package, I saved the downloaded data to my project in a new subfolder called “Data.”

Note: there are three hyper-links called “Data for Government Office Region (E12).” We only need the one from the “Profile: Alcohol Profile” heading.

Note: if you are accessing this data organically from a web search rather than through the provided link, be sure to change the “Area type” tab to “Regions (Statistical)” to ensure you download the data set that includes statistics across the regions of England.

Loading the Data

We will be loading the using the “here” package. Make sure you have it downloaded and loaded for this project. I saved my raw data to a subfolder called “Data.” If you have done something different, edit the following code to match your chosen directory.

#loading in the data set and assigning it the name "raw" and making it a tibble
raw <- read_csv(here("Data", "indicators-Regionsstatistical.data.csv"))
raw_tib <- as_tibble(raw)

After loading the data, I immediately transformed the data into a tibble to make the following tidying and filtering of the data easier for R Studio to read.

Viewing and Understanding the Data

The data was collected by the Department of Health and Social Care as part of an initiative by the Office for Health Improvement and Disparities to track different alcohol-related indicators in the United Kingdom. There are 24 different indicators tracked in this profile, including such things as alcohol-related mortality, under 75 mortality rate from alcoholic liver disease, and admission episodes for alcohol-specific conditions. This data has been collected from public health services such as hospitals and health centres across the UK. For this project, we are specifically looking at the indicator of “admission episodes for alcohol-specific conditions.” Let’s take a look at the beginning of the raw data.

#getting a first look at the raw data
head(raw_tib)
# A tibble: 6 × 27
  `Indicator ID` `Indicator Name`        `Parent Code` `Parent Name` `Area Code`
           <dbl> <chr>                   <chr>         <chr>         <chr>      
1          90861 Under 75 mortality rat… <NA>          <NA>          E92000001  
2          90861 Under 75 mortality rat… <NA>          <NA>          E92000001  
3          90861 Under 75 mortality rat… <NA>          <NA>          E92000001  
4          90861 Under 75 mortality rat… E92000001     England       E92000001  
5          90861 Under 75 mortality rat… E92000001     England       E92000001  
6          90861 Under 75 mortality rat… E92000001     England       E92000001  
# ℹ 22 more variables: `Area Name` <chr>, `Area Type` <chr>, Sex <chr>,
#   Age <chr>, `Category Type` <chr>, Category <chr>, `Time period` <chr>,
#   Value <dbl>, `Lower CI 95.0 limit` <dbl>, `Upper CI 95.0 limit` <dbl>,
#   `Lower CI 99.8 limit` <dbl>, `Upper CI 99.8 limit` <dbl>, Count <dbl>,
#   Denominator <dbl>, `Value note` <chr>, `Recent Trend` <chr>,
#   `Compared to England value or percentiles` <chr>,
#   `Compared to percentiles` <chr>, `Time period Sortable` <dbl>, …

The data has 27 columns of variables, most of which are not pertinent for creating my visualization. Immediately, you can see indicator data including the names of the indicators and their respective ID numbers. We can also see region data under the “Area Name” and “Area Code” columns. There is also information for age and sex of patients. Probably the most important variable is the “Value” column, which gives information pertinent to each indicator. In the case of my project, the Value variable is telling us how many Alcohol-Specific Admissions occurred per 100,000 people in the given years and regions.

The Shapefile Data to Build the Map

In order to build the map itself, we need a shapefile of England’s nine regions (which include the North East, North West, Yorkshire and the Humber, East Midlands, West Midlands, East of England, London, South East, and South West). I obtained this shapefile from the Open Geography Portal by the Office of National Statistics. The shapefile called “Regions (December 2023) Boundaries EN BGC.” You can download the shapefile here. On this linked webpage, hit download on the left-hand side, and when you do, ten download options will be available. Choose the second option called “Shapefile.” Because I am using the “here” package, I saved the downloaded data to my project in the “Data” subfolder, like before.

Note: this download will need to be unzipped in order to access it.

Load the Shapefile using the code below.

#loading the map data
map <- read_sf(here("Data", "Regions_December_2023_Boundaries_EN_BGC_7009286823150963379"))

Research Question

My visualization is attempting to answer the following question: how have the numbers of alcohol-specific hospital admissions changed in each region of England since the year 2016? I suppose this is a very specific question, but hopefully through this visualization, it will be clear if the different regions of England are getting better or worse when it comes to their problematic alcohol use. I will urge viewers not to misunderstand or misrepresent the data. This is a measure of change, not a comparative measure of admissions in total. You might notice that the North West region is getting better, while the West Midlands is getting worse. However, in total, the North West region has roughly 150 more admissions per 100,000 than the West Midlands, which means that their drinking behavior is more problematic than the West Midlands according to total admissions. The graph is showcasing positive and negative trends in alcohol consumption, not comparing which regions are healthier (in a manner of speaking) than others.

Data Preparation

After loading all the data, we need to clean it! To start, I checked if there are any duplicated items in the data with the following code. Thankfully, none were detected.

#checking for duplicated data, no duplications were found
duplicated(raw_tib)

After that, I ran a few pieces of code to make the variable names cleaner. Primarily, that meant removing spaces in the variables of interest and their respective column data observations. The following code replaces any spaces with underscores for four of the variables of interest and all observations.

#renaming relevant variables to remove spaces
names(raw_tib)[names(raw_tib) == "Indicator ID"] <- "Indicator_ID"
names(raw_tib)[names(raw_tib) == "Area Name"] <- "Area_Name"
names(raw_tib)[names(raw_tib) == "Time period"] <- "Time_Period"
names(raw_tib)[names(raw_tib) == "Area Code"] <- "Area_Code"

#replacing spaces in the data with underscores
raw_cleaned <- raw_tib %>%
  mutate_if(is.character, str_replace_all, ' ', '_')

Next, I went through and selected only the variables and observations from the raw data that I needed for my visualization. I called this new data frame “df”, and those variables included Indicator_ID, Area_Name, Time_Period, and Value. Within them, I needed to make sure I was only getting region specific information, so I filtered out the area code for England as a whole. I also filtered out all the indicators that I wasn’t using and made sure that we were using sex data from all persons, not just males or females.

#filtering and selecting the relevant data
df <- raw_cleaned %>% 
  filter(Indicator_ID == 92906) %>% 
  filter(Area_Code != "E92000001") %>% 
  filter(Sex == "Persons") %>% 
  select(Indicator_ID, Area_Name, Time_Period, Value)

Using the nanair package, I also removed any missing pieces of data.

#removing missing data
df <- drop_na(df)

The last two steps in cleaning the overall data were to round the observations in the Value column to the nearest whole number at to filter for only the correct indicator. I rounded the variables because the Department of Health and Social Care did the same when they created their own visualizations of their data. Why should I decide to do otherwise? The filtering step may seem unnecessary, but I have kept it primarily as a renaming function which was necessary for my following code. (Originally, I was using two indicators so two names were needed. As changes were made, I kept the naming conventions to reduce the risk of personal mistakes.)

#rounding values to nearest whole number 
#(I made this decision only because the original data source did the same)
df <- df %>% 
  mutate_if(is.numeric, round)

#filtering further to separate 92906 data from the rest of the tibble
df_all <- df %>% 
  filter(Indicator_ID == 92906)

Filtering, Filtering, Filtering

The last step of cleaning the data was creating 9 different data frames from the combined data from for the 9 regions. This is a bit arduous, but it ended up being necessary in order for me to create disparately colored regions in the final visualization.

#list of filters for the region types of all admission
df_all_NE <- df_all %>% 
  filter(Area_Name == "North_East_region_(statistical)")
df_all_NW <- df_all %>% 
  filter(Area_Name == "North_West_region_(statistical)")
df_all_York <- df_all %>% 
  filter(Area_Name == "Yorkshire_and_the_Humber_region_(statistical)")
df_all_EM <- df_all %>% 
  filter(Area_Name == "East_Midlands_region_(statistical)")
df_all_WM <- df_all %>% 
  filter(Area_Name == "West_Midlands_region_(statistical)")
df_all_EE <- df_all %>% 
  filter(Area_Name == "East_of_England_region_(statistical)")
df_all_Lon <- df_all %>% 
  filter(Area_Name == "London_region_(statistical)")
df_all_SE <- df_all %>% 
  filter(Area_Name == "South_East_region_(statistical)")
df_all_SW <- df_all %>% 
  filter(Area_Name == "South_West_region_(statistical)")

Similarly, I had to separate out the data from the Shapefile for the 9 regions.

#filtering regions from the map
m_NE <- map %>% 
  filter(RGN23CD == "E12000001")
m_NW <- map %>% 
  filter(RGN23CD == "E12000002")
m_York <- map %>% 
  filter(RGN23CD == "E12000003")
m_EM <- map %>% 
  filter(RGN23CD == "E12000004")
m_WM <- map %>% 
  filter(RGN23CD == "E12000005")
m_EE <- map %>% 
  filter(RGN23CD == "E12000006")
m_Lon <- map %>% 
  filter(RGN23CD == "E12000007")
m_SE <- map %>% 
  filter(RGN23CD == "E12000008")
m_SW <- map %>% 
  filter(RGN23CD == "E12000009")

Before we start the visualization creation, let’s take a look at the combined data frame:

#a glimpse of the combined df
head(df_all)
# A tibble: 6 × 4
  Indicator_ID Area_Name                                     Time_Period Value
         <dbl> <chr>                                         <chr>       <dbl>
1        92906 North_East_region_(statistical)               2016/17       770
2        92906 North_West_region_(statistical)               2016/17       832
3        92906 Yorkshire_and_the_Humber_region_(statistical) 2016/17       589
4        92906 East_Midlands_region_(statistical)            2016/17       484
5        92906 West_Midlands_region_(statistical)            2016/17       536
6        92906 East_of_England_region_(statistical)          2016/17       407

The Visualization

Before building the final visualization, I first needed to create the subplots that would appear when you hover over the regions of the map. Before doing that though, I needed to build a save location to put the subplots in order to save them as tooltip items. I did this with the following code.

#The goal of this section is to set a file path to save the subplots in a tooltip
#to pop them up when we hover over them in the final map plot
plot_to_svg_base64 <- function(plot_obj, width = 4, height = 3) { 
  # Create a temporary file path
  tmp_file <- tempfile(fileext = ".svg")
  
  # Save the plot object to the temporary SVG file
  ggsave(
    tmp_file, 
    plot = plot_obj, 
    device = svglite, 
    width = width, 
    height = height, 
    units = "in"
  )
  
  # Read the SVG content
  svg_content <- readChar(tmp_file, file.info(tmp_file)$size)
  
  # Clean up the temporary file
  unlink(tmp_file)
  
  # Encode the SVG content to Base64 to embed directly in HTML tooltip
  base64_string <- base64encode(charToRaw(svg_content))
  
  # Return the full data URI
  return(paste0("data:image/svg+xml;base64,", base64_string))
}

Once that was set, I could build 9 subplots for the 9 regions using ggplot2:

#plot of the time series of all admissions in the North East Region
ts_all_NE <- df_all_NE %>% 
  ggplot(aes(x = Time_Period, y = Value)) +
  geom_point(color = "black") +
  geom_line(aes(group = "Area_Name"), color = "darkred") +
  ylim(600, 1000) +
  labs(title = "Admissions in the North East", 
       x = "Years", 
       y = "Number of Admissions (per 100,000)") +
  theme_classic() +
  theme(plot.title = element_text(size = 10, face = "bold"), 
        axis.title = element_text(size = 8), 
        axis.text = element_text(size = 7, angle = 45, hjust = 1, vjust = 1))
ne_plot_svg <- plot_to_svg_base64(ts_all_NE, width = 3, height = 2.5)

#plot of the time series of all admissions in the North West Region
ts_all_NW <- df_all_NW %>% 
  ggplot(aes(x = Time_Period, y = Value)) +
  geom_point(color = "black") +
  geom_line(aes(group = "Area_Name"), color = "skyblue") +
  ylim(700, 900) +
  labs(title = "Admissions in the North West", 
       x = "Years", 
       y = "Number of Admissions (per 100,000)") +
  theme_classic() +
  theme(plot.title = element_text(size = 10, face = "bold"), 
        axis.title = element_text(size = 8), 
        axis.text = element_text(size = 7, angle = 45, hjust = 1, vjust = 1))
nw_plot_svg <- plot_to_svg_base64(ts_all_NW, width = 3, height = 2.5)

#plot of the time series of all admissions in the Yorkshire Region
ts_all_York <- df_all_York %>% 
  ggplot(aes(x = Time_Period, y = Value)) +
  geom_point(color = "black") +
  geom_line(aes(group = "Area_Name"), color = "orange") +
  ylim(400, 800) +
  labs(title = "Admissions in Yorkshire \nand the Humber", 
       x = "Years", 
       y = "Number of Admissions (per 100,000)") +
  theme_classic() +
theme(plot.title = element_text(size = 10, face = "bold"), 
      axis.title = element_text(size = 8), 
      axis.text = element_text(size = 7, angle = 45, hjust = 1, vjust = 1))
york_plot_svg <- plot_to_svg_base64(ts_all_York, width = 3, height = 2.5)

#plot of the time series of all admissions in the East Midland Region
ts_all_EM <- df_all_EM %>% 
  ggplot(aes(x = Time_Period, y = Value)) +
  geom_point(color = "black") +
  geom_line(aes(group = "Area_Name"), color = "orange") +
  ylim(400, 700) +
  labs(title = "Admissions in the East Midlands", 
       x = "Years", 
       y = "Number of Admissions (per 100,000)") +
  theme_classic() +
  theme(plot.title = element_text(size = 10, face = "bold"), 
        axis.title = element_text(size = 8), 
        axis.text = element_text(size = 7, angle = 45, hjust = 1, vjust = 1))
em_plot_svg <- plot_to_svg_base64(ts_all_EM, width = 3, height = 2.5)

#plot of the time series of all admissions in the West Midland Region
ts_all_WM <- df_all_WM %>% 
  ggplot(aes(x = Time_Period, y = Value)) +
  geom_point(color = "black") +
  geom_line(aes(group = "Area_Name"), color = "darkred") +
  ylim(400, 700) +
  labs(title = "Admissions in the West Midlands", 
       x = "Years", 
       y = "Number of Admissions (per 100,000)") +
  theme_classic() +
  theme(plot.title = element_text(size = 10, face = "bold"), 
        axis.title = element_text(size = 8), 
        axis.text = element_text(size = 7, angle = 45, hjust = 1, vjust = 1))
wm_plot_svg <- plot_to_svg_base64(ts_all_WM, width = 3, height = 2.5)

#plot of the time series of all admissions in the East of England Region
ts_all_EE <- df_all_EE %>% 
  ggplot(aes(x = Time_Period, y = Value)) +
  geom_point(color = "black") +
  geom_line(aes(group = "Area_Name"), color = "orange") +
  ylim(300, 600) +
  labs(title = "Admissions in the East of England", 
       x = "Years", 
       y = "Number of Admissions (per 100,000)") +
  theme_classic() +
  theme(plot.title = element_text(size = 10, face = "bold"), 
        axis.title = element_text(size = 8), 
        axis.text = element_text(size = 7, angle = 45, hjust = 1, vjust = 1))
ee_plot_svg <- plot_to_svg_base64(ts_all_EE, width = 3, height = 2.5)

#plot of the time series of all admissions in the London Region
ts_all_Lon <- df_all_Lon %>% 
  ggplot(aes(x = Time_Period, y = Value)) +
  geom_point(color = "black") +
  geom_line(aes(group = "Area_Name"), color = "orange") +
  ylim(400, 700) +
  labs(title = "Admissions in the London Region", 
       x = "Years", 
       y = "Number of Admissions (per 100,000)") +
  theme_classic() +
  theme(plot.title = element_text(size = 10, face = "bold"), 
        axis.title = element_text(size = 8), 
        axis.text = element_text(size = 7, angle = 45, hjust = 1, vjust = 1))
lon_plot_svg <- plot_to_svg_base64(ts_all_Lon, width = 3, height = 2.5)

#plot of the time series of all admissions in the South East Region
ts_all_SE <- df_all_SE %>% 
  ggplot(aes(x = Time_Period, y = Value)) +
  geom_point(color = "black") +
  geom_line(aes(group = "Area_Name"), color = "skyblue") +
  labs(title = "Admissions in the South East", 
       x = "Years", 
       y = "Number of Admissions (per 100,000)") +
  ylim(100, 800) +
  theme_classic() +
  theme(plot.title = element_text(size = 10, face = "bold"), 
        axis.title = element_text(size = 8), 
        axis.text = element_text(size = 7, angle = 45, hjust = 1, vjust = 1))
se_plot_svg <- plot_to_svg_base64(ts_all_SE, width = 3, height = 2.5)

#plot of the time series of all admissions in the South West Region
ts_all_SW <- df_all_SW %>% 
  ggplot(aes(x = Time_Period, y = Value)) +
  geom_point(color = "black") +
  geom_line(aes(group = "Area_Name"), color = "orange") +
  labs(title = "Admissions in the South West", 
       x = "Years", 
       y = "Number of Admissions (per 100,000)") +
  ylim(400, 800) +
  theme_classic() +
  theme(plot.title = element_text(size = 10, face = "bold"), 
        axis.title = element_text(size = 8), 
        axis.text = element_text(size = 7, angle = 45, hjust = 1, vjust = 1))
sw_plot_svg <- plot_to_svg_base64(ts_all_SW, width = 3, height = 2.5)

Finally, we could build the overall visualization. As a quick precursor, I set a vector to contain the colors I wanted for each region.

#begin by setting a vector of the colors used in the legend
colors_legend <- c("Increasing" = "darkred", 
                   "Decreasing" = "skyblue", 
                   "No Significant Change" = "orange")

And now the plot!

#plotting the map
final <- ggplot() +
  geom_sf_interactive(data = m_NE, size = .5, color = "black", aes(fill = "Increasing", tooltip = paste0("<b>", RGN23NM, "</b><br>", 
                                                                                                         "Change: Increasing<br><br>",
                                                                                                         '<img src="', ne_plot_svg, '" width="300" height="250" />'))) +
  geom_sf_interactive(data = m_NW, size = .5, color = "black", aes(fill = "Decreasing", tooltip = paste0("<b>", RGN23NM, "</b><br>", 
                                                                                                         "Change: Decreasing<br><br>",
                                                                                                         '<img src="', nw_plot_svg, '" width="300" height="250" />'))) +
  geom_sf_interactive(data = m_York, size = .5, color = "black", aes(fill = "No Significant Change", tooltip = paste0("<b>", RGN23NM, "</b><br>", 
                                                                                                                      "Change: No Significant Change<br><br>",
                                                                                                                      '<img src="', york_plot_svg, '" width="300" height="250" />'))) +
  geom_sf_interactive(data = m_EM, size = .5, color = "black", aes(fill = "No Significant Change", tooltip = paste0("<b>", RGN23NM, "</b><br>", 
                                                                                                                    "Change: No Significant Change<br><br>",
                                                                                                                    '<img src="', em_plot_svg, '" width="300" height="250" />'))) +
  geom_sf_interactive(data = m_WM, size = .5, color = "black", aes(fill = "Increasing", tooltip = paste0("<b>", RGN23NM, "</b><br>", 
                                                                                                         "Change: Increasing<br><br>",
                                                                                                         '<img src="', wm_plot_svg, '" width="300" height="250" />'))) +
  geom_sf_interactive(data = m_EE, size = .5, color = "black", aes(fill = "No Significant Change", tooltip = paste0("<b>", RGN23NM, "</b><br>", 
                                                                                                                    "Change: No Significant Change<br><br>",
                                                                                                                    '<img src="', ee_plot_svg, '" width="300" height="250" />'))) +
  geom_sf_interactive(data = m_Lon, size = .5, color = "black", aes(fill = "No Significant Change", tooltip = paste0("<b>", RGN23NM, "</b><br>", 
                                                                                                                     "Change: No Significant Change<br><br>",
                                                                                                                     '<img src="', lon_plot_svg, '" width="300" height="250" />'))) +
  geom_sf_interactive(data = m_SE, size = .5, color = "black", aes(fill = "Decreasing", tooltip = paste0("<b>", RGN23NM, "</b><br>", 
                                                                                                         "Change: Decreasing<br><br>",
                                                                                                         '<img src="', se_plot_svg, '" width="300" height="250" />'))) +
  geom_sf_interactive(data = m_SW, size = .5, color = "black", aes(fill = "No Significant Change", tooltip = paste0("<b>", RGN23NM, "</b><br>", 
                                                                                                                    "Change: No Significant Change<br><br>",
                                                                                                                    '<img src="', sw_plot_svg, '" width="300" height="250" />'))) +
  ggtitle("Changes in Alcohol-Specific Hospital Admissions Since 2016") +
  theme_void() +
  theme(panel.grid = element_line(color = "transparent")) +
  labs(fill = "Legend",
       subtitle = "Across England's Nine Regions") +
  scale_fill_manual(values = colors_legend) +
  theme(plot.title = element_text(size = 15, hjust = 0)) +
  theme(plot.subtitle = element_text(size = 10, hjust = 0.95)) +
  theme(legend.position = "right")

girafe(ggobj = final)

Summary

I have learned a lot through the creation of this project. The most guided section was tidying the data; we covered all of the functions and cleaning items during the lectures, which meant that I was quite ready when it came to tidying and transforming my data into something usable. The creation of the subplots was also fairly straightforward from the lecture material. On the other hand, creating map graphs was something I had to learn on my own, which was a bit of a learning curve. I think it was valuable especially in helping me know how to search online and how to read the information that other coders had put online. Similarly, this project aided me in learning how to use AI in diagnosing errors in my code and suggesting edits to produce desired outputs. If I were to have more time, I would consider creating a second plot looking at a different indicator to compare the two. For example, I might look at the data on alcohol-specific admissions for patients under the age of 18 and plot it side-by-side with the current graph as a comparative tool. Overall, I feel much more equipped in creating visualizations that are functional and interactive as well as more equipped to use the many tools and resources that support the actual coding process.